> Red Wine Dataset: This dataset holds information for 1599 red wines of the Portuguese “Vinho Verde” wine. The inputs include objective tests (e.g. pH values, fixed acidity, residual sugar, etc.) and the output is based on sensory data (wine quality between 0 and 10). The wine quality was graded by experts. We will use this dataset to find out how the input variables are related to the quality of the wine. At first, we will visualize our input variables to understand how they are distributed. Then, we will move one and perform a bivariate analysis. We will see how the input variables are correlated with each other and which input variables are correlated to the wine quality. Furthermore, we will draw multivariate plots to combine different variables and visualize their relation to the wine quality. Finally, we will use our gained knowledge to set up a model that predicts the quality of wine based on the given input variables.
> Dataset Variables:
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
> Dataset Summary:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
> Quality: This bar chart shows the distribution of quality. Most of the wines were rated 5 or 6 points out of 10 by the experts.
> Alcohol: The distribution of alcohol has a slight positive skew.
> Volatile Acidity: The volatile acidity is distributed around the mean of 0.53 g/dm^3.
> Sulphates: We can see that there are a few outliers in the sulphates distribution. We might keep this in mind for further explorations.
> Fixed Acidity: The fixed acidity is distributed around the mean of 8.32 g/dm^3.
> Citric Acid: The distribution for citric acid appears uniform. However, we can clearly see some outliers. 132 wines do not contain any citric acid. Another 68 wines have a citric acid of 0.49 g/dm^3. One wine has a citric acid of 1 g/dm^3.
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
> Residual Sugar: The distribution of residual sugar is centered on the left side with a few outliers located on the right side of the plot. This also may be interesting for further data explorations.
> Chlorides: The chlorides distribution is centered on the left side of the plot, either.
> Free Sulfur Dioxide: The free sulfur dioxide variable has a strong positive skew on a linear axis. This is why I decided to log transform the x-axis. Subsequently, the log transformed distribution resembles a normal distribution.
> Total Sulfur Dioxide: The Total Sulfur Dioxide distribution has been log transformed, either.
> Density: The density curve is distributed around its mean of 0.997 g/cm^3.
> pH: The average pH value for the red wines in our dataset is 3.3. The pH values are normally distributed.
There are 1599 red wines in the dataset with 11 input variables and one output variable. All variables are numerical.
Input variables:
- fixed acidity (tartaric acid - g / dm^3)
- volatile acidity (acetic acid - g / dm^3)
- citric acid (g / dm^3)
- residual sugar (g / dm^3)
- chlorides (sodium chloride - g / dm^3
- free sulfur dioxide (mg / dm^3)
- total sulfur dioxide (mg / dm^3)
- density (g / cm^3)
- pH
- sulphates (potassium sulphate - g / dm3)
- alcohol (% by volume)
Output variable:
- quality (score between 0 and 10)
The main feature in the dataset is the wine quality. I suspect that there are correlations between the input variables and the wine quality.
I assume that every input variable is of interest to exploration purposes. However, I assume that alcohol will be interesting for us in this data exploration since the alcoholic strength might have had an impact on the experts rating. :)
I did not create any new variables.
I log-transformed the right skewed free sulfur dioxide and total sulfur dioxide distributions. The transformed distributions for dioxides appear normally distributed with the free sulfur dioxide peaking around 10 mg / dm^3 and the total sulfur dioxide peaking around 40 mg / dm^3.
Moreover, we can clearly see that the distributions for residual sugar, chlorides and sulphates are centered to the left with only a few outliers with high values. I was also surprised by the quality ratings as I expected to see the ratings in a wider range (IQR of 3-8 instead of 5-6).
The following matrices show the correlations between the variables:
> Approach: At first, we will have a look at the variables which are strongly correlated with the quality of the wines. The top 3 features are:
- alcohol (correlation: 0.476)
- volatile acidity (correlation: -0.391)
- sulphates (correlation: 0.251)
Then we will determine the strongest correlations amongst all variables.
- citric acid vs fixed acidity (correlation: 0.672)
- pH vs fixed acidity (correlation: -0.683)
> Bivariate scatterplot quality vs alcohol: The wine quality is positively correlated with the alcohol strength. The correlation is r = 0.476. The boxplots on the right hand illustrate a strong linear relationship for the quality ratings between 5 and 8.
> Bivariate scatterplot quality vs volatile acidity: Volatile acidity has the strongest negative correlation with the wine quality. Both the linear regression and the boxplots overlaying the scatterplots emphasize this relationship.
> Bivariate scatterplot quality vs sulphates: The sulphates variable has the third strongest correlation with the quality variable. With regard to the univariate analysis of sulphates we can now see that most of the outliers are centered around middle quality ratings of 5-6. This is not surprising as most of the wines in this dataset lie in this range.
> Bivariate scatterplots fixed acidity vs citric acid and fixed acidity vs pH: On the left hand we can see the strong positive relationship between fixed acidity and citric acid. This makes a lot of sense as citric acid is one of the fixed acids in wines along with malic and tartaric acids.
On the right hand we can examine a very strong negative correlation between the fixed acidity and pH values. This should not surpise us, neither. Lower pH values result in more acidic liquids and vice versa.
> Bivariate scatterplots fixed acidity vs density and alcohol vs density: On the left hand we see the positive correlation between fixed acidity and density. The scatterplot on the right hand shows the negative relationship between alcohol and density.
The wine quality is positively correlated with the alcohol strength (correlation: 0.476). Furthermore, the quality is negatively correlated with the volatile acidity (correlation: -0.391). The third most important driver of the wine quality is the sulphates parameter (correlation: 0.251).
The strongest relationships among the input variables were between fixed acidity and citric acid as well as between fixed acidity and pH values. These data patterns confirm our basic knowledge about chemistry.
Strong, interesting relationships which I would not have expected are between:
- fixed acidity and density (correlation: 0.668)
- density and alcohol (correlation: -0.496)
The strongest correlation was between fixed acidity and pH (correlation: -0.683).
> Wine quality by alcohol and volatile acidity: We have incorporated a new leveling for the quality rating: Bad, Middle and Good. This makes it easier for us to identify remarkable patterns in the data.
The first multivariate plot shows the wine quality by alcohol and volatile acidity. These variables have the strongest correlations with the wine quality. Below we can recognize that the wine quality increases in the upper left corner of the scatterplot.
> Wine quality by alcohol and sulphates: Here we can see how the wine quality changes for different alcohol strengths and sulphates inputs. The better wines seem to have input values in the upper right corner of the scatterplot.
> Wine quality by volatile acidity and sulphates: This scatterplot once again shows the negative correlation between volatile acidity and wine quality. Wines with low sulphates values and a high volatile acidity seem to get bad ratings.
> Wine quality by citric acid and fixed acidity: In this plot it is not possible to determine a clear pattern.
> Wine quality by pH and fixed acidity: This multivariate plot does not provide us with clear patterns, neither.
> Wine quality by density and fixed acidity: This plot shows us a very positive correlation between fixed acidity and density. For a given fixed acidity, we can recommend that it makes sense to aim for a low density.
> Wine quality by density and alcohol: This plot visualizes that the alcohol strength should be at least 10% as the wines on the right corner are rated better than the rest.
> Prediction Model:
- output variable: quality
- input variables: alcohol, volatile acidity, sulphates, total sulfur dioxide, citric acid
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = df)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = df)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
## data = df)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide, data = df)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## total.sulfur.dioxide + citric.acid, data = df)
##
## ==============================================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.826*** 2.843***
## (0.175) (0.184) (0.196) (0.201) (0.205)
## alcohol 0.361*** 0.314*** 0.309*** 0.295*** 0.295***
## (0.017) (0.016) (0.016) (0.016) (0.016)
## volatile.acidity -1.384*** -1.221*** -1.199*** -1.222***
## (0.095) (0.097) (0.097) (0.112)
## sulphates 0.679*** 0.712*** 0.721***
## (0.101) (0.101) (0.103)
## total.sulfur.dioxide -0.002*** -0.002***
## (0.001) (0.001)
## citric.acid -0.043
## (0.104)
## ----------------------------------------------------------------------------------------------
## R-squared 0.227 0.317 0.336 0.344 0.344
## adj. R-squared 0.226 0.316 0.335 0.342 0.342
## sigma 0.710 0.668 0.659 0.655 0.655
## F 468.267 370.379 268.912 208.768 166.962
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1599.384 -1589.835 -1589.749
## Deviance 805.870 711.796 692.105 683.887 683.814
## AIC 3448.114 3251.628 3208.768 3191.669 3193.499
## BIC 3464.245 3273.136 3235.654 3223.932 3231.138
## N 1599 1599 1599 1599 1599
## ==============================================================================================
> Prediction: Following variables have been defined to predict the wine quality:
- Alcohol: 12%
- Volatile Acidity: 0.3g/dm^3
- Sulphates: 0.75g/dm^3
- Total Sulfur Dioxide: 25 mg / dm^3
- One standarad deviation confidence interval
## fit lwr upr
## 1 6.340534 5.688252 6.992816
The multivariate plots helped to understand how good wines are different in terms of input values compared to the other wines in the dataset. Some plots enabled us to see clear patterns e.g. the plots ‘Wine quality by alcohol and volatile acidity’, ‘Wine quality by alcohol and sulphates’, ‘Wine quality by volatile acidity and sulphates’ and ‘Wine quality by density and alcohol’. If we tried to select appropriate features for further machine learning activities, these visualizations would help us in understanding the data and selecting the best features.
For me, the surprising plots were the ‘Wine quality by density and fixed acidity’ and ‘Wine quality by density and alcohol’. I did not expect a strong relationship between the input variables and was even more surprised by the clarity of patterns in the data.
Yes, I created a prediction model. For this purpose, I selected the quality to be the output variable and the variables which are correlated with the wine quality as input for the linear regression model. I calculated five different models, but the last model did not show any improvement in terms of pearson’s r. The maximum pearson’s r is 0.344. This is not a high value so the prediction model is not very precise. But we can use this model to get a feeling in which range (e.g. one standard deviation confidence interval) the quality of the wine might lie given the input variables alcohol, volatile acidity, sulphates and total sulfur dioxide.
These bivariate plots illustrate the correlation between the alcohol strength and the wine quality. Both the linear regression line on the left hand and the boxplots on the right hand show that these variables are positively correlated. This means that wines with a higher alcohol strength tend to have a higher quality.
This multivariate plot examines how the quality is impacted by the alcohol strength and the sulphates input. This plot shows that both alcohol and sulphates are positively correlated with the wine quality. This results in a higher density of good wines in the upper right corner.
This plot shows us a very positive correlation between fixed acidity and density. For a given fixed acidity, we can recommend that it makes sense to aim for a low density. In contrast, if we work with a given wine density, we can state that it makes sense to produce higher acidic wines as those tend to receive better ratings.
In this project we have explored a dataset of 1599 red wines. We have looked at the variables itself and how they are correlated with each other. We have drawn specific attention to the input variables which are strongly correlated with the associated wine quality. In the bivariate plots section we illustrated these correlations and finalized it in the multivariate plots section. My main struggle in the beginning was driven by the fact that many wines are rated in a very small range of 5-6. This made it difficult to create effective multivariate plots. We have solved this problem by categorizing the quality integer values into bad, middle and good wines.
This dataset might also be interesting for machine learning applications. My regression model was based on a very simple linear relationship between the variables. However, I can imagine that we can significantly improve the performance of the prediction model by selecting the best features and using more sophisticated machine learning algorithms e.g. decision tree or support vector regression. The multivariate plots illustrated that promising decision surfaces can be built based on the input variables.